pacman::p_load(sf, sfdep, tmap, tidyverse, knitr,
patchwork, DT, mapview, leaflet, scales)
# - Creates a package list containing the necessary R packages
# - Checks if the R packages in the package list have been installed
# - If not installed, will install the missing packages & launch into R environment.Take-home Exercise 1: Geospatial Analytics for Public Good
1 Overview
1.1 Background
As urban infrastructures, including public transportation systems like buses, taxis, mass rapid transit, public utilities, and roads, become increasingly digitised, the data generated becomes a valuable resource for tracking the movements of people and vehicles over space and time. This transformation has been facilitated by pervasive computing technologies such as Global Positioning System (GPS) and Radio Frequency Identification (RFID) tags on vehicles. An example of this is the collection of data on bus routes and ridership, amassed from the use of smart cards and GPS devices available on public buses.
The data collected from these sources is likely to contain valuable patterns that offer insights into various aspects of human movement and behavior within a city. Analyzing and comparing these patterns can provide a deeper understanding of urban mobility. Such insights can be instrumental in improving urban management and can also serve as valuable information for both public and private sector stakeholders involved in urban transportation services. This information can aid them in making informed decisions and gaining a competitive edge in their respective domains.
1.2 Objectives
The objective of this study is to utilize appropriate Local Indicators of Spatial Association (LISA) and Emerging Hot Spot Analysis (EHSA) techniques to uncover spatial and spatio-temporal mobility patterns among public bus passengers in Singapore.
This will include the following tasks:
- Geovisualisation and Analysis:
- Compute the passenger trips generated by origin at the hexagon level
- Display the geographical distribution of the passenger trips
- Explore spatial patterns revealed by the geovisualisation
| Peak hour period | Bus tap on time |
|---|---|
| Weekday morning peak | 6am to 9am |
| Weekday afternoon peak | 5pm to 8pm |
| Weekend/holiday morning peak | 11am to 2pm |
| Weekend/holiday evening peak | 4pm to 7pm |
- Local Indicators of Spatial Association (LISA) Analysis - Compute LISA of the passenger trips generate by origin - Display and draw statistical conclusions of LISA maps
- Emerging Hot Spot Analysis (EHSA)
- Perform Mann-Kendall Test by using the spatio-temporal local Gi* values
- Display EHSA maps of the Gi* values, describe the spatial patterns revealed
2 Loading Packages
The following packages will be used for this exercise:
The code chunk below, using p_load function of the pacman package, ensures that sfdep, sf, tmap and tidyverse packages are installed and loaded in R.
3 Data Preparation
3.1 The Data
The following data are used for this study:
- Aspatial:
- Passenger Volume by Origin Destination Bus Stops for August, September and October 2023, downloaded from LTA DataMall using API.
- Geospatial
- Bus Stop Location from LTA DataMall. It provides information about all the bus stops currently being serviced by buses, including the bus stop code (identifier) and location coordinates.
- hexagon, a hexagon layer of 250m is provided to replace the relative coarse and irregular Master Plan 2019 Planning Sub-zone GIS data set of URA.
3.2 Import & Preparation
3.2.1 Aspatial
::: panel-tabset
Import into R
We will be importing the Passenger Volume by Origin Destination Bus Stops dataset from August to October 2023, downloaded from LTA DataMall by using read_csv() or readr package.
odbus08 <- read_csv("data/aspatial/origin_destination_bus_202308.csv")
#odbus09 <- read_csv("data/aspatial/origin_destination_bus_202309.csv")
#odbus10 <- read_csv("data/aspatial/origin_destination_bus_202310.csv")Data Exploration
(a) Attributes
glimpse() of the dplyr package allows us to see all columns and their data type in the data frame.
glimpse(odbus08)Rows: 5,709,512
Columns: 7
$ YEAR_MONTH <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <chr> "04168", "04168", "80119", "80119", "44069", "4406…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "1722…
$ TOTAL_TRIPS <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
#glimpse(odbus09)
#glimpse(odbus10)Insights:
- There are 7 variables in the odbus08 tibble data, they are:
- YEAR_MONTH: Month in which data is collected
- DAY_TYPE: Weekdays or weekends/holidays
- TIME_PER_HOUR: Hour which the passenger trip is based on, in intervals from 0 to 23 hours
- PT_TYPE: Type of public transport, i.e. bus
- ORIGIN_PT_CODE: Origin bus stop ID
- DESTINATION_PT_CODE: Destination bus stop ID
- TOTAL_TRIPS: Number of trips
- We also note that values in ORIGIN_PT_CODE and DESTINATON_PT_CODE are in numeric data type.
(b) Unique Bus Stops
n_distinct() of the dplyr package allows us to count the unique bus stops in the data set.
n_distinct(odbus08$ORIGIN_PT_CODE)[1] 5067
The results reveal that there are 5067 distinct origin bus stops.
Data Wrangling
(a) Convert Data Type
as.factor() can be used to convert the variables ORIGIN_PT_CODE and DESTINATON_PT_CODE from numeric to categorical data type. We use glimpse() again to check the results.
odbus08$ORIGIN_PT_CODE <- as.factor(odbus08$ORIGIN_PT_CODE)
odbus08$DESTINATION_PT_CODE <- as.factor(odbus08$DESTINATION_PT_CODE)
glimpse(odbus08)Rows: 5,709,512
Columns: 7
$ YEAR_MONTH <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <fct> 04168, 04168, 80119, 80119, 44069, 44069, 20281, 2…
$ DESTINATION_PT_CODE <fct> 10051, 10051, 90079, 90079, 17229, 17229, 20141, 2…
$ TOTAL_TRIPS <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
Note that both of them are in factor data type now.
(b) Duplicates Check
Before moving on to the next step, it is a good practice for us to check for duplicated records to prevent double counting of passenger trips.
duplicate <- odbus08 %>%
group_by_all() %>%
filter(n()>1) %>%
ungroup()
duplicate# A tibble: 0 × 7
# ℹ 7 variables: YEAR_MONTH <chr>, DAY_TYPE <chr>, TIME_PER_HOUR <dbl>,
# PT_TYPE <chr>, ORIGIN_PT_CODE <fct>, DESTINATION_PT_CODE <fct>,
# TOTAL_TRIPS <dbl>
Results confirm that there are no duplicated records found.
(c) Extracting the Study Data
In our study, we would like to know patterns for 4 peak hour periods. Therefore, we can create a new variable period using the ifelse() that states whether an observation occurred during peak period using the code chunk below.
peak <- odbus08 %>%
# Weekday morning peak
mutate(period= ifelse(DAY_TYPE=="WEEKDAY" & (TIME_PER_HOUR >= 6 & TIME_PER_HOUR <= 9), "WDM",
# Weekday afternoon peak
ifelse(DAY_TYPE=="WEEKDAY" & (TIME_PER_HOUR >= 17 & TIME_PER_HOUR <= 20), "WDA",
# Weekend/holiday morning peak
ifelse(DAY_TYPE=="WEEKENDS/HOLIDAY" & (TIME_PER_HOUR >= 11 & TIME_PER_HOUR <= 14), "WEM",
# Weekend/holiday evening peak
ifelse(DAY_TYPE=="WEEKENDS/HOLIDAY" & (TIME_PER_HOUR >= 16 & TIME_PER_HOUR <= 19), "WEE",
# Return off-peak if neither of the peak hour periods
"Off-peak")))))We can then filter for peak-period data using the newly created period column and aggregate the total trips for each origin bus stop during peak period.
peakperiods <- peak %>%
# filter helps to keep records that occurred during period periods
filter(period !="Off-peak") %>%
# aggregate the total passenger trips for each origin bus stop
group_by(period, ORIGIN_PT_CODE) %>%
summarise(TRIPS=sum(TOTAL_TRIPS))Let’s visualise the proportions of passenger volumes for each peak period.
Show the code
freq<- ggplot(data=peakperiods,
aes(x=period,y=TRIPS))+
geom_bar(stat="identity") +
theme(legend.position="none")+
labs(title = "Frequency of Trip for each Peak Period",
x = "Peak Period",
y = "Frequency")
freq + scale_y_continuous(labels=label_comma())
We can see that passenger volume on weekdays are much higher than over the weekends/holidays.
Transpose each peak period period as a columns using pivot_wider() of tidyr package will allow us to create further variables at a bus stop level. We replace NA values with 0 to reflect when there are no traffic for certain periods.
peakperiods_wide <- pivot_wider(peakperiods,
names_from = "period",
values_from = "TRIPS")
peakperiods_wide["WDA"][is.na(peakperiods_wide["WDA"])] <- 0
peakperiods_wide["WDM"][is.na(peakperiods_wide["WDM"])] <- 0
peakperiods_wide["WEE"][is.na(peakperiods_wide["WEE"])] <- 0
peakperiods_wide["WEM"][is.na(peakperiods_wide["WEM"])] <- 0
glimpse(peakperiods_wide)Rows: 5,067
Columns: 5
$ ORIGIN_PT_CODE <fct> 01012, 01013, 01019, 01029, 01039, 01059, 01109, 01112,…
$ WDA <dbl> 8448, 7328, 3608, 9317, 12937, 2133, 322, 45010, 27233,…
$ WDM <dbl> 1973, 952, 1789, 2561, 2938, 1651, 161, 8492, 9015, 424…
$ WEE <dbl> 3208, 2796, 1623, 4244, 7403, 1190, 88, 21706, 11927, 6…
$ WEM <dbl> 2273, 1697, 1511, 3272, 5424, 1062, 89, 14964, 8278, 61…
Notice that there are 5067 unique origin bus stops.
(d) Variable Transformation
Show the code
# Extract column
distWDM <- peakperiods_wide$WDM
# Calculate mean
distWDM_mean <- mean(distWDM)
plot_distWDM <- ggplot(
data = data.frame(distWDM),
aes(x = distWDM)
) +
geom_histogram(
bins = 20,
color = "#FFFCF9",
fill = "black",
alpha = .3
) +
# Add line for mean
geom_vline(
xintercept = distWDM_mean,
color = "#595DE5",
linetype = "dashed",
linewidth = 1
) +
# Add line annotations
annotate(
"text",
x = 80000,
y = 2000,
label = paste("Mean =", round(distWDM_mean, 3)),
color = "#595DE5",
size = 3
) +
labs(
title = "Weekday Morning Peak",
x = "Bus Trips",
y = "Frequency"
) +
theme(
axis.text.x = element_text(angle = 45, hjust=1)
) +
scale_x_continuous(labels = label_number(), n.breaks=8)
# Extract column
distWDA <- peakperiods_wide$WDA
# Calculate mean
distWDA_mean <- mean(distWDA)
plot_distWDA <- ggplot(
data = data.frame(distWDA),
aes(x = distWDA)
) +
geom_histogram(
bins = 20,
color = "#FFFCF9",
fill = "black",
alpha = .3
) +
# Add line for mean
geom_vline(
xintercept = distWDA_mean,
color = "#595DE5",
linetype = "dashed",
linewidth = 1
) +
# Add line annotations
annotate(
"text",
x = 110000,
y = 2000,
label = paste("Mean =", round(distWDA_mean, 3)),
color = "#595DE5",
size = 3
) +
labs(
title = "Weekday Afternoon Peak",
x = "Bus Trips",
y = "Frequency"
) +
theme(
axis.text.x = element_text(angle = 45, hjust=1)
) +
scale_x_continuous(labels = label_number(), n.breaks=8)
# Extract column
distWEM <- peakperiods_wide$WEM
# Calculate mean
distWEM_mean <- mean(distWEM)
plot_distWEM <- ggplot(
data = data.frame(distWEM),
aes(x = distWEM)
) +
geom_histogram(
bins = 20,
color = "#FFFCF9",
fill = "black",
alpha = .3
) +
# Add line for mean
geom_vline(
xintercept = distWEM_mean,
color = "#595DE5",
linetype = "dashed",
linewidth = 1
) +
# Add line annotations
annotate(
"text",
x = 23000,
y = 2000,
label = paste("Mean =", round(distWEM_mean, 3)),
color = "#595DE5",
size = 3
) +
labs(
title = "Weekdend/Holiday Morning Peak",
x = "Bus Trips",
y = "Frequency"
) +
theme(
axis.text.x = element_text(angle = 45, hjust=1)
) +
scale_x_continuous(labels = label_number(), n.breaks=8)
# Extract column
distWEE <- peakperiods_wide$WEE
# Calculate mean
distWEE_mean <- mean(distWEE)
plot_distWEE <- ggplot(
data = data.frame(distWEE),
aes(x = distWEE)
) +
geom_histogram(
bins = 20,
color = "#FFFCF9",
fill = "black",
alpha = .3
) +
# Add line for mean
geom_vline(
xintercept = distWEE_mean,
color = "#595DE5",
linetype = "dashed",
linewidth = 1
) +
# Add line annotations
annotate(
"text",
x = 29000,
y = 2000,
label = paste("Mean =", round(distWEE_mean, 3)),
color = "#595DE5",
size = 3
) +
labs(
title = "Weekend/Holiday Evening Peak",
x = "Bus Trips",
y = "Frequency"
) +
theme(
axis.text.x = element_text(angle = 45, hjust=1)
) +
scale_x_continuous(labels = label_number(), n.breaks=8)
(plot_distWDM | plot_distWDA)/
(plot_distWEM | plot_distWEE)
The distribution of passenger trips for the 4 peak periods appear to be highly skewed to the right. Rescaling our data using log transformation can greatly reduce the skewness.
peakperiods_wider <- peakperiods_wide %>%
mutate(logWDM = ifelse(WDM == 0, 0, log(WDM)),
logWDA = ifelse(WDA == 0, 0, log(WDA)),
logWEM = ifelse(WEM == 0, 0, log(WEM)),
logWEE = ifelse(WEE == 0, 0, log(WEE)))Let’s visualise the distribution of the 4 peak periods again.
Show the code
# Extract column
distlogWDM <- peakperiods_wider$logWDM
# Calculate mean
distlogWDM_mean <- mean(distlogWDM)
plot_distlogWDM <- ggplot(
data = data.frame(distlogWDM),
aes(x = distlogWDM)
) +
geom_histogram(
bins = 20,
color = "#FFFCF9",
fill = "#34414E",
alpha = .3
) +
# Add line for mean
geom_vline(
xintercept = distlogWDM_mean,
color = "#595DE5",
linetype = "dashed",
linewidth = 1
) +
# Add line annotations
annotate(
"text",
x = 10,
y = 1000,
label = paste("Mean =", round(distlogWDM_mean, 3)),
color = "#595DE5",
size = 3
) +
labs(
title = "Weekday Morning Peak",
x = "Bus Trips",
y = "Frequency"
) +
theme(
axis.text.x = element_text(angle = 45, hjust=1)
)
# Extract column
distlogWDA <- peakperiods_wider$logWDA
# Calculate mean
distlogWDA_mean <- mean(distlogWDA)
plot_distlogWDA <- ggplot(
data = data.frame(distlogWDA),
aes(x = distlogWDA)
) +
geom_histogram(
bins = 20,
color = "#FFFCF9",
fill = "#34414E",
alpha = .3
) +
# Add line for mean
geom_vline(
xintercept = distlogWDA_mean,
color = "#595DE5",
linetype = "dashed",
linewidth = 1
) +
# Add line annotations
annotate(
"text",
x = 10,
y = 1000,
label = paste("Mean =", round(distlogWDA_mean, 3)),
color = "#595DE5",
size = 3
) +
labs(
title = "Weekday Afternoon Peak",
x = "Bus Trips",
y = "Frequency"
) +
theme(
axis.text.x = element_text(angle = 45, hjust=1)
)
# Extract column
distlogWEM <- peakperiods_wider$logWEM
# Calculate mean
distlogWEM_mean <- mean(distlogWEM)
plot_distlogWEM <- ggplot(
data = data.frame(distlogWEM),
aes(x = distlogWEM)
) +
geom_histogram(
bins = 20,
color = "#FFFCF9",
fill = "#34414E",
alpha = .3
) +
# Add line for mean
geom_vline(
xintercept = distlogWEM_mean,
color = "#595DE5",
linetype = "dashed",
linewidth = 1
) +
# Add line annotations
annotate(
"text",
x = 10,
y = 1000,
label = paste("Mean =", round(distlogWEM_mean, 3)),
color = "#595DE5",
size = 3
) +
labs(
title = "Weekdend/Holiday Morning Peak",
x = "Bus Trips",
y = "Frequency"
) +
theme(
axis.text.x = element_text(angle = 45, hjust=1)
)
# Extract column
distlogWEE <- peakperiods_wider$logWEE
# Calculate mean
distlogWEE_mean <- mean(distlogWEE)
plot_distlogWEE <- ggplot(
data = data.frame(distlogWEE),
aes(x = distlogWEE)
) +
geom_histogram(
bins = 20,
color = "#FFFCF9",
fill = "#34414E",
alpha = .3
) +
# Add line for mean
geom_vline(
xintercept = distlogWEE_mean,
color = "#595DE5",
linetype = "dashed",
linewidth = 1
) +
# Add line annotations
annotate(
"text",
x = 10,
y = 1000,
label = paste("Mean =", round(distlogWEE_mean, 3)),
color = "#595DE5",
size = 3
) +
labs(
title = "Weekend/Holiday Evening Peak",
x = "Bus Trips",
y = "Frequency"
) +
theme(
axis.text.x = element_text(angle = 45, hjust=1)
)
(plot_distlogWDM | plot_distlogWDA)/
(plot_distlogWEM | plot_distlogWEE)
3.2.2 Geospatial
(a) Bus Stop Shapefile
In this section, we import BusStop shapefile into RStudio using st_read() function of sf package. This data provides the locations of all bus stops as at Q2 of 2023. crs = 3414 ensures coordinate reference system (CRS) is 3414, which is the EPSG code for the SVY21 projection used in Singapore.
busstop <- st_read(dsn = "data/geospatial",
layer = "BusStop") %>%
st_transform(crs = 3414)Reading layer `BusStop' from data source
`C:\kytjy\ISSS624\Take-Home_Ex\Take-Home_Ex1\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
The imported shape file is simple features object of sf. From the output, we can see that there are 5161 points with 3 fields, and confirm that the datum SVY21 is correct.
Recall that there are 5067 origin bus stops from the peakperiods_wider table, compared to the 5161 bus stops from LTA’s BusStop shape file. This could be due to timing difference – LTA’s BusStop shapefile is as of July 2023, while peakperiod is based on Aug 2023.
mapview::mapview(busstop)Note that there are 5 bus stops located outside Singapore, they are bus stops 46239, 46609, 47701, 46211, and 46219.
(b) Hexagon Layer
A hexagonal grid is used to replace the relative coarse and irregular Master Plan 2019 Planning Sub-zone GIS data set of URA. Hexagons have a number of advantages over these other shapes:
- The distance between the centroid of a hexagon to all neighboring centroids is the same in all directions.
- The lack of acute angles in a regular hexagon means that no areas of the shape are outliers in any direction.
- All neighboring hexagons have the same spatial relationship with the central hexagon, making spatial querying and joining a more straightforward process.
- Unlike square-based grids, the geometry of hexagons are well-structured to represent curves of geographic features which are rarely perpendicular in shape, such as rivers and roads.
- The “softer” shape of a hexagon compared to a square means it performs better at representing gradual spatial changes.
We first create a hexagonal grid layer of 250m (refers to the perpendicular distance between the centre of the hexagon and its edges) with st_make_grid, and st_sf to convert the grid into an sf object with the codes below.
st_make_grid function is used to create a grid over a spatial object. It takes 4 arguments, they are:
x: sf object; the input spatial data
cellsize: for hexagonal cells the distance between opposite edges in the unit of the crs the spatial data is using. In this case, we take cellsize to be 250m * 2 = 500m
- what: character; one of:
"polygons","corners", or"centers" - square: indicates whether you are a square grid (TRUE) or hexagon grid (FALSE)
area_hexagon_grid = st_make_grid(busstop, 500, what = "polygons", square = FALSE)Next, st_sf converts the grid created to sf object while lengths() of Base R is used to calculate the number of grids created.
# Converts grid to sf
hexagon_grid_sf = st_sf(area_hexagon_grid) %>%
# Assign unique ID to each grid
mutate(grid_id = 1:length(lengths(area_hexagon_grid)))We count the number of bus stops in each grid and keep grids with bus stops using the code chunks below.
# Create a column containing the count of bus stops in each grid
hexagon_grid_sf$busstops = lengths(st_intersects(hexagon_grid_sf, busstop))
# Remove if no bus stop in side that grid, ie only keep hexagons with bus stops
hexagon_w_busstops = filter(hexagon_grid_sf, busstops > 0)Let’s confirm that all bus stops have been accounted for in our hexagon layer.
sum(hexagon_w_busstops$busstops)[1] 5161
This is in line with the 5161 points of the busstop shapefile.
Lastly, using tm_shape of tmap, we can quickly visualise the results of the hexagon grids we have created.
Show the code
tmap_mode ("view")
hex <- tm_shape(hexagon_w_busstops)+
tm_fill(
col = "busstops",
palette = "Blues",
style = "cont",
title = "Number of Bus Stops",
id = "grid_id",
showNA = FALSE,
alpha = 0.6,
popup.format = list(
grid_id = list(format = "f", digits = 0)
)
)+
tm_borders(col = "grey40", lwd = 0.7)
hex(a) Combining Busstop and hexagon layer
Code chunk below populates the grid ID (i.e. grid_id) of hexagon_w_busstops sf data frame into busstop sf data frame.
bs_wgrids <- st_intersection(busstop, hexagon_w_busstops) %>%
select(BUS_STOP_N,BUS_ROOF_N,LOC_DESC, grid_id, busstops) %>%
st_drop_geometryst_intersection()is used to perform point and polygon overly and the output will be in point sf object.select()of dplyr package is then use to retain only BUS_STOP_N and SUBZONE_C in the busstop_mpsz sf data frame.st_stop_geometry()removes the geometry data to manipulate it like a regular dataframe usingtidyranddplyrfunctions
Before we proceed, let’s perform a duplicates check on bs_wgrids.
duplicate2 <- bs_wgrids %>%
group_by_all() %>%
filter(n()>1) %>%
ungroup()
duplicate2# A tibble: 8 × 5
BUS_STOP_N BUS_ROOF_N LOC_DESC grid_id busstops
<chr> <chr> <chr> <int> <int>
1 43709 B06 BLK 644 1904 7
2 43709 B06 BLK 644 1904 7
3 58031 UNK OPP CANBERRA DR 2939 7
4 58031 UNK OPP CANBERRA DR 2939 7
5 51071 B21 MACRITCHIE RESERVOIR 3081 6
6 51071 B21 MACRITCHIE RESERVOIR 3081 6
7 97079 B14 OPP ST. JOHN'S CRES 5037 5
8 97079 B14 OPP ST. JOHN'S CRES 5037 5
Results displayed 4 genuine duplicated records. We remove these to prevent double-counting.
The code chunk below helps retain unique records.
bs_wgrids <- unique(bs_wgrids)(c) Populate PeakPeriods with Grid Details
We can now append the grid ID from bs_wgrids data frame onto peakperiods_wide data frame. Recall we previously identified 5 bus stops outside Singapore, filter() allows us to exclude the 5 outside Singapore.
origin_grid <- left_join(peakperiods_wider, bs_wgrids,
by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
rename(ORIGIN_BS = ORIGIN_PT_CODE) %>%
group_by(grid_id) %>%
# retains SG bus stops
filter(!ORIGIN_BS %in% c(46239, 46609, 47701, 46211, 46219))
glimpse(origin_grid)Rows: 5,076
Columns: 13
Groups: grid_id [1,504]
$ ORIGIN_BS <chr> "01012", "01013", "01019", "01029", "01039", "01059", "0110…
$ WDA <dbl> 8448, 7328, 3608, 9317, 12937, 2133, 322, 45010, 27233, 932…
$ WDM <dbl> 1973, 952, 1789, 2561, 2938, 1651, 161, 8492, 9015, 4240, 5…
$ WEE <dbl> 3208, 2796, 1623, 4244, 7403, 1190, 88, 21706, 11927, 6221,…
$ WEM <dbl> 2273, 1697, 1511, 3272, 5424, 1062, 89, 14964, 8278, 6198, …
$ logWDM <dbl> 7.587311, 6.858565, 7.489412, 7.848153, 7.985484, 7.409136,…
$ logWDA <dbl> 9.041685, 8.899458, 8.190909, 9.139596, 9.467847, 7.665285,…
$ logWEM <dbl> 7.728856, 7.436617, 7.320527, 8.093157, 8.598589, 6.967909,…
$ logWEE <dbl> 8.073403, 7.935945, 7.392032, 8.353261, 8.909641, 7.081709,…
$ BUS_ROOF_N <chr> "B03", "B05", "B04", "B07", "B09", "B08", "TMNL", "B07", "B…
$ LOC_DESC <chr> "HOTEL GRAND PACIFIC", "ST JOSEPH'S CH", "BRAS BASAH CPLX",…
$ grid_id <int> 3292, 3292, 3292, 3323, 3354, 3324, 3324, 3292, 3324, 3292,…
$ busstops <int> 8, 8, 8, 7, 8, 7, 7, 8, 7, 8, 7, 7, 8, 7, 7, 7, 7, 7, 7, 7,…
(d) Retrieve Geometry
origin_gridwgeom <- inner_join(hexagon_w_busstops,
origin_grid,
by = "grid_id")
#origin_gridwgeom <- st_as_sf(origin_gridwgeom)In order to retain the geospatial properties, the left data frame must the sf data.frame (i.e. hexagon_w_busstop).
4 Geovisualisation & Analysis
4.1 Data Classification
Different classification schemes highlight areas with the highest and/or lowest values, while others create classes that result in a more uniform distribution of colors. When data is sharply skewed or has extreme outliers, it’s important to consider whether the goal is to emphasize those areas or to achieve a more even distribution of colors and sizes.
The main methods of data classification are: - Quantile: each class contains an equal number of features. It assigns the same number of data values to each class. There are no empty classes or classes with too few or too many values - Jenks/Natural breaks: seeks clumps of values that are clustered together in order to form categories that may reflect meaningful groupings of areas - Equal: divides the range of attribute values into equal-sized sub-ranges
Since our variable is less skewed after log transformation, we can explore various classification methods for visualization. This approach may reveal interesting insights that were not immediately apparent before.
4.2 Plots
4.2.1 Weekday Morning Peak Period
Show the code
tmap_mode ("view")
plotlogWDM_q <- tm_basemap("CartoDB.Positron") +
tm_shape(origin_gridwgeom) +
tm_fill("logWDM",
style = "quantile",
palette = "Blues",
title = "Total passenger trips",
alpha=0.6,
id="LOC_DESC")
plotlogWDM_qThe grids above are partitioned using the quantile intervals. We can observe that the bus trips are unevenly distributed across Singapore. There are lighter shares of blue (indicating lower levels of ridership) originating from the edges of the country, particularly in the West, while higher levels of ridership in the North region are indicated by the darker shades of blue.
Bus stops nearer to the residential estates appeared to be popular during the weekday morning peak period:
- West: BLK 821, BLK 252, Sunshine Place
- North: BLK 314
- North-East: BLK 477A, BLK 1, BLK 555, BLK 324
- East: BLK 109, BLK 124, BLK 756
This is likely due to a large number of people commuting from home to their workplaces/schools on weekday mornings.
Higher passenger traffic were noted at the bus stops nearer to MRT stations such as Harbourfront Station, Farrer Road Station, Yio Chu Kang Station, and Admirality Station. A possible contributing factor could be the people who are transiting from taking the MRTs to buses to get to their destinations.
Lastly, Woodlands Checkpoint also demonstrated higher ridership. This could potentially be due to the people commuting across the causeway from Malaysia into the Singapore borders.
Show the code
tmap_mode ("view")
plotlogWDM_e <- tm_basemap("CartoDB.Positron") +
tm_shape(origin_gridwgeom) +
tm_fill("logWDM",
style = "equal",
palette = "Blues",
title = "Total passenger trips",
alpha=0.6,
id="LOC_DESC")
plotlogWDM_eThe map using equal intervals provided slightly different insights. We noted that the bus stop located near to MRT stations had higher levels of ridership. In particular, more trips originated from Tiong Bahru Station, Buona Vista Station, Tanah Merah Station, Admiralty Station, Harbourfront, and Woodleigh Station. Bus interchanges also appeared to be popular origins, i.e. Bukit Panjang Interchange and Joo Koon Interchange.
In general, more homogeneity is noted using the equal interval – the contrast between hexagon to hexagon is less obvious.
Show the code
tmap_mode ("view")
plotlogWDM_j <- tm_basemap("CartoDB.Positron") +
tm_shape(origin_gridwgeom) +
tm_fill("logWDM",
style = "jenks",
palette = "Blues",
title = "Total passenger trips",
alpha=0.6,
id="LOC_DESC")
plotlogWDM_jUsing Jenk’s partitioning method, the results were largely similar to the other two types of interval classes. Higher bus ridership were spotted at bus stops within close proximity to MRT stations (Kranji Station, Buona Vista Station, Buangkok Station, Ranggung Station, Farrer Road Station, Stevens Station, Bedok Reservoir Station) and residential estates (Sunshine Place near Tengah, BLK 109 in Bedok, BLK 477A in Sengkang, Bef. BLK 629A in Woodlands, to name a few).
4.2.2 Weekday Afternoon Peak Period
Show the code
tmap_mode ("view")
plotlogWDA_q <- tm_basemap("CartoDB.Positron") +
tm_shape(origin_gridwgeom) +
tm_fill("logWDA",
style = "quantile",
palette = "Reds",
title = "Total passenger trips",
alpha=0.6,
id="LOC_DESC")
plotlogWDA_qA look at the weekday afternoon ridership using the quantile classification yielded the following insights.
- Ridership from Woodlands Checkpoint remained high.
- Bus stops in close proximity to MRT tations and popular bus stops in residential estatements remained high.
- More trips originating from institutional areas: Opposite Ngee Ann Poly, Temasek Poly, NIE BLK 2, School of the Arts
- More trips originating from industrial buildings/business parks: North Link Bldg, Aft Senoko Way, Mapletree Business City, Woodlands Auto Hub, Opp Airline Hse, etc.
- More trips originating from hospitals: Yishun Community Hospital, Changi General Hospital
- Seletar Camp also looked to have high passenger levels
- The far West seemed to experience low ridership other than the bus stop opposite Tuas Link Station.
- Southern part of Singapore, consisting of more commercial areas, appeared to be more clustered as illustrated by the density of the darker red hexagons, compared to the weekday morning peak period.
Show the code
tmap_mode ("view")
plotlogWDA_e <- tm_basemap("CartoDB.Positron") +
tm_shape(origin_gridwgeom) +
tm_fill("logWDA",
style = "equal",
palette = "Reds",
title = "Total passenger trips",
alpha=0.6,
id="LOC_DESC")
plotlogWDA_eNotably, there were higher concentration of passengers who boarded the bus at Serangoon Station, Harbourfront/VivoCity, Tiong Bahru Station, Admiralty Station, and Punggol Station during weekday afternoons according to the equal interval classification method.
Similar to the map for weekday morning peak period, the equal interval seemed to produce more homogeneous classifications.
Show the code
tmap_mode ("view")
plotlogWDA_j <- tm_basemap("CartoDB.Positron") +
tm_shape(origin_gridwgeom) +
tm_fill("logWDA",
style = "jenks",
palette = "Reds",
title = "Total passenger trips",
alpha=0.6,
id="LOC_DESC")
plotlogWDA_jJenk’s classification delivered similar insights to the quantile classification, where the higher concentration of ridership can be observed in the Southern downtown areas.
It also highlighted Opp Airline Hse in the far East as a bus stop with high ridership, something not as visible using the equal interval method.
Alternative methods of commute might be more popular in the West and North-West regions illustrated by the lighter shades of red hexagons.
4.2.3 Weekend/Holiday Morning Peak Period
Show the code
tmap_mode ("view")
plotlogWEM_q <- tm_basemap("CartoDB.Positron") +
tm_shape(origin_gridwgeom) +
tm_fill("logWEM",
style = "quantile",
palette = "Blues",
title = "Total passenger trips",
alpha=0.6,
id="LOC_DESC")
plotlogWEM_qGenerally, the distribution of bus ridership looks varied across the island.
During the weekend/holiday peak period, the ridership for far West region remained relatively low. Interestingly, the bus trips recorded from Seletar area appeared to have dipped compared to the weekday peak periods. Buses in these industrial areas could be oriented towards work-related travel, thus less common on weekends.
Bus stops nearer to housing estates, shopping malls, and Woodlands Checkpoint demonstrated higher levels of weekend morning ridership.
The bus stops with the highest boarding passengers are Sunshine Place, Opp BLK 271, BLK 252, Aft. Hasanah Mosque, Buona Vista Station, Harbourfront/Vivocity, Admiralty Station, BLK 555, Bedok Reservoir Station, BLK 22, BLK 109.
Show the code
tmap_mode ("view")
plotlogWEM_e <- tm_basemap("CartoDB.Positron") +
tm_shape(origin_gridwgeom) +
tm_fill("logWEM",
style = "equal",
palette = "Blues",
title = "Total passenger trips",
alpha=0.6,
id="LOC_DESC")
plotlogWEM_eEqual interval classification highlighted the following bus stops to have the highest ridership during weekend/holiday morning peak period: Harbourfront/Vivocity, Tiong Bahru Station, Orchard Station/Lucky Plaza, Admirality Station, Aft. Punggol Road.
Show the code
tmap_mode ("view")
plotlogWEM_j <- tm_basemap("CartoDB.Positron") +
tm_shape(origin_gridwgeom) +
tm_fill("logWEM",
style = "jenks",
palette = "Blues",
title = "Total passenger trips",
alpha=0.6,
id="LOC_DESC")
plotlogWEM_jThe findings noted with Jenk’s method are similar to the quantile classification.
4.2.4 Weekend/Holiday Evening
Show the code
tmap_mode ("view")
plotlogWEE_q <- tm_basemap("CartoDB.Positron") +
tm_shape(origin_gridwgeom) +
tm_fill("logWEE",
style = "quantile",
palette = "Reds",
title = "Total passenger trips",
alpha=0.6,
id="LOC_DESC")
plotlogWEE_qOn weekend/holiday evenings, there seemed to be increased ridership at the bus stops near Changi Airport compared to the other peak periods.
Sunshine Plaza remains one of the most frequented bus stops, exhibiting high ridership across all four peak periods. While the exact reason for this is difficult to pinpoint, it’s possible that the buses stopping here connect to a wide variety of regions, which could explain the high ridership.
Woodlands Checkpoint also demonstrated high levels of ridership across the different peak periods.
Visually, it looks like there are more bus stops with high ridership across Singapore during the weekend/holiday evening peak period. For example, there seem to be an increase in passenger volume at the Tanah Merah Ferry compared to the other peak periods.
Show the code
tmap_mode ("view")
plotlogWEE_e <- tm_basemap("CartoDB.Positron") +
tm_shape(origin_gridwgeom) +
tm_fill("logWEE",
style = "equal",
palette = "Reds",
title = "Total passenger trips",
alpha=0.6,
id="LOC_DESC")
plotlogWEE_eThe bus stops with higher traffic seem to be quite consistent across the different peak periods. This includes Woodlands Checkpoint, Kranji Station, Admiralty Station, Serangoon Station, Aft. Punggol Road, Bukit Panjang MRT, Yio Chu Kang Interchange.
Show the code
tmap_mode ("view")
plotlogWEE_j <- tm_basemap("CartoDB.Positron") +
tm_shape(origin_gridwgeom) +
tm_fill("logWEE",
style = "jenks",
palette = "Reds",
title = "Total passenger trips",
alpha=0.6,
id="LOC_DESC")
plotlogWEE_jThe findings noted with Jenk’s method are similar to the quantile classification.
5 Spatial Association Analysis
According to Tobler’s First Law of Geography, “everything is related to everything else, but near things are more related than distant things.”
This sub-section will cover the computation of local spatial autocorrelation statistics and spatial complete randomness test for local spatial autocorrelation. The goal of these analyses is to understand whether there are clusters or outliers of bus stop with high or low ridership across Singapore.
5.1 Global Spatial Autocorrelation
5.1.1 Spatial Weights Matrix
To compute the local spatial autocorrelation statistics, we first need to construct a spatial weights of Singapore. Spatial relationships are multi-directional and multi-lateral. We can use spatial weights to define who the neighbours of the spatial units.
There are two common methods of spatial weights: contiguity-based and distance-based.
Contiguity-based: Neighbours share a common boundary, which can be further distinguished between a Rook and a Queen criterion of contiguity. Rook contiguity defines neighbours by the existence of a common edge between two spatial units. In Queen contiguity defines neighbours as spatial units sharing a common edge or a common vertex.
Distance-based: Assign higher weights to pairs of locations that are closer to each other and lower weights to pairs that are further. This can be further distinguished by fixed weighting, adaptive weighting and inverse-distance weighting schemes. Fixed weighting scheme considers two regions are neighbours if they are within a specified distance from one another. For adaptive weighting scheme, each region will have the same number of neighbours. The number of neighbour is specified beforehand, where k = number of neighbours. Inverse distance method considers that the closer two features are in space, the more likely they are to interact/influence each other.
For this study, we will be using distance-based weight matrix as there are areas where bus stops are sparse (such as Lim Chu Kang and Mandai) and isolated (for example, Tanah Merah Ferry, Changi Naval Base, Resort World Sentosa, Marina Bay Cruise Centre). Consequently, contiguity-based matrix may yield many regions with no neighbours, making it not suitable for our analysis.
Step 1: Determine Cut-Off Distance Limit
First step is to determine the upper distance limit to ensure that each hexagon has at least 1 neighbour.
The following functions can be used:
- st_knn() of sfdep is used to identify neighbors based on k (e.g. k = 8 indicates the nearest eight neighbours). The output is a neighbours list of class nb. If polygon geometry is provided, the centroids of the polygon will be used for calculation.
- st_nb_dists() of sfdep is used to calculate the nearest neighbour distance. The output is a list of distances for each observation’s neighbors list.
- unlist() of Base R is then used to return the output as a vector so that the summary statistics of the nearest neighbour distances can be derived.
geo <- sf::st_geometry(origin_gridwgeom)
nb <- st_knn(geo,
longlat = TRUE)
dists <- unlist(st_nb_dists(geo, nb))Step 2: Derive Summary Stats
We can derive summary statistics of the nearest neighbour distances vector (i.e. dists) by using the code chunk below.
summary(dists) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.00 22.89 0.00 1000.00
The maximum nearest neighbour distance is 1000m, thus we will use threshold value of 1001m to ensure each spatial unit has a minimum of 1 neighbour.
Step 3: Compute fixed distance weight
Now we will go ahead to compute the fixed distance weights by using following functions:
st_dists_band() of sfdep is used to identify neighbors based on a distance band (i.e. 1000m). The output is a list of neighbours (i.e. nb). st_weights() is then used to calculate polygon spatial weights of the nb list. Note that the default style argument is set to “W” for row standardized weights, and the default allow_zero is set to TRUE, assigns zero as lagged value to zone without neighbors.
wm_fd <- origin_gridwgeom %>%
mutate(nb = st_dist_band(geo,
upper = 1001),
wt = st_weights(nb),
.before = 1)Step 4: Observations
Show the code
#kable(head(wm_fd,5))
summary(wm_fd$nb)Neighbour list object:
Number of regions: 5022
Number of nonzero links: 266698
Percentage nonzero weights: 1.057466
Average number of links: 53.10593
Link number distribution:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
3 6 3 4 7 6 20 19 20 13 24 13 21 29 20 35 28 40 34 40
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
24 52 45 33 39 48 26 44 31 30 31 52 48 39 35 55 48 49 83 55
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
74 67 59 73 76 83 96 81 59 53 101 136 87 117 101 84 111 79 134 127
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
114 150 96 113 103 103 106 81 94 107 61 73 92 55 41 64 61 73 42 31
81 82 83 84 85 86 87 88 89 90 92 93 94 95 96 97 98 99 100 102
40 32 9 30 30 26 9 19 14 21 6 9 12 15 8 4 4 6 13 5
3 least connected regions:
2274 5021 5022 with 1 link
5 most connected regions:
3699 3700 3701 3702 3703 with 102 links
From the result above, we can confirm that all hexagons have at least one neighbour and there are 5 hexagons with 102 neighbours. We can also identify an average of 53 neighbours per hexagon using the distance-based weight matrix.
A characteristic of fixed distance weights matrix is that more densely settled areas (town, residential neighbourhoods) tend to have more neigbours while less densely settle areas (military camps, industrial estates) tend to have less neighbours. To overcome the issue of fixed distance weight matrix where there is uneven distribution of neighbours, we can directly control the number of neighbours using k-nearest neighbours by setting the value of k in the code chunk below.
As a rule-of-thumb, we will set k = 8 i.e., all hexagons will have 8 neighbours.
wm_ad <- origin_gridwgeom %>%
mutate(nb = st_knn(geo,
k=8),
wt = st_weights(nb),
.before = 1)
head(wm_ad, n=3)Simple feature collection with 3 features and 16 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 3720.122 ymin: 27925.48 xmax: 4720.122 ymax: 31100.9
Projected CRS: SVY21 / Singapore TM
nb
1 2, 4, 5, 9, 10, 15, 32, 33
2 1, 4, 5, 9, 10, 15, 32, 33
3 5, 6, 7, 11, 12, 16, 17, 18
wt grid_id busstops.x
1 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125 34 1
2 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125 65 1
3 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125 99 1
ORIGIN_BS WDA WDM WEE WEM logWDM logWDA logWEM logWEE BUS_ROOF_N
1 25059 417 62 65 5 4.127134 6.033086 1.609438 4.174387 UNK
2 25751 110 50 26 24 3.912023 4.700480 3.178054 3.258097 B02D
3 26379 249 44 54 27 3.784190 5.517453 3.295837 3.988984 NIL
LOC_DESC busstops.y area_hexagon_grid
1 AFT TUAS STH BLVD 1 POLYGON ((3970.122 27925.48...
2 BEF TUAS STH AVE 14 1 POLYGON ((4220.122 28358.49...
3 YONG NAM 1 POLYGON ((4470.122 30523.55...
The results show that the weights of the neighbours have been assigned to 1/8 (0.125) of the total weight, representing each of the 8 neighbours.
Inverse distance weights takes into account the decay functions of distance.
We can derive spatial weight matrix based on inverse distance method using the following functions:
st_contiguity() of sfdep is used to identify the neighbours by using contiguity criteria. The output is a list of neighbours (i.e. nb).
st_inverse_distance() is then used to calculate inverse distance weights of neighbours on the nb list.
wm_idw <- origin_gridwgeom %>%
mutate(nb = st_contiguity(geo),
wts = st_inverse_distance(nb, geo,
scale = 1,
alpha = 1),
.before = 1)
summary(wm_idw$nb)Neighbour list object:
Number of regions: 5022
Number of nonzero links: 107808
Percentage nonzero weights: 0.4274621
Average number of links: 21.46714
6 regions with no links:
1750 2274 3159 4675 4989 4994
Link number distribution:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
6 12 17 33 23 97 70 99 88 76 115 136 126 133 130 160 168 227 197 169
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
159 174 251 240 213 195 222 210 183 187 124 115 104 110 111 39 54 56 11 17
40 41 42 43 44 48
55 62 13 17 10 8
12 least connected regions:
1 32 62 736 3354 3355 4982 4990 5002 5003 5021 5022 with 1 link
8 most connected regions:
3048 3049 3050 3051 3052 3053 3054 3055 with 48 links
Using the inverse distance method resulted in 6 regions with no neighbours, this could be due to the spatial isolation of certain hexagons.
In summary:
- The number of neighbours using fixed distance method vary widely from 1 to 102. Consequently, the uneven distribution could affect the spatial autocorrelation analysis.
- Inverse distance method led to regions with no neighbours and is computationally intensive as each neighbour
- Since each hexagon is equally sized, the adaptive distance-based spatial weight matrix would be best suited for our analysis since each centroid can represent each region well.
centroid <- st_centroid(origin_gridwgeom)
plot(origin_gridwgeom$area_hexagon_grid, border = "lightgrey")
#plot(wm_idw, centroid, pch = 19, cex = 0.1, add = TRUE, col = "red")5.2 Computing Global Spatial Autocorrelation Statistics
5.2.1 Moran’s I
We will perform Moran’s I statistical testing by using global_moran_perm() of spdep. The Global Moran’s I Permutation Test is a statistical method used in spatial analysis to assess the significance of spatial autocorrelation in a dataset. Spatial autocorrelation refers to the degree to which a variable is correlated with itself across space, indicating patterns such as clustering or dispersion.
The Moran I statistic ranges from -1 to 1. If the Moran I is:
- positive (I>0): Clustered, observations tend to be similar
- negative (I<0): Disperse, observations tend to be dissimilar
- approximately zero: observations arranged randomly over space
The code chunk below performs Moran’s I statistical testing, using the null and alternative hypotheses as follows:
H0: The observed spatial patterns of proportion of bus ridership in Singapore are not clustered (i.e. either random or dispersed). H1: The observed spatial patterns of proportion of bus ridership in Singapore are clustered.
A total of 100 simulations will be performed with a seed number 1234. set.seed() function allows us to create reproducible results.
Note: nsim arugment of global_moran_perm() refers to the number of simulations is nsim + 1, i.e., for nsim = 99, 100 simulations will be performed.
set.seed(1234)Show the code
gmp_WDM <- global_moran_perm(wm_ad$WDM,
wm_ad$nb,
wm_ad$wt,
nsim = 99)
gmp_WDM
Monte-Carlo simulation of Moran I
data: x
weights: listw
number of simulations + 1: 100
statistic = 0.094609, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
Show the code
gmp_WDA <- global_moran_perm(wm_ad$WDA,
wm_ad$nb,
wm_ad$wt,
nsim = 99)
gmp_WDA
Monte-Carlo simulation of Moran I
data: x
weights: listw
number of simulations + 1: 100
statistic = 0.063584, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
Show the code
gmp_WEM <- global_moran_perm(wm_ad$WEM,
wm_ad$nb,
wm_ad$wt,
nsim = 99)
gmp_WEM
Monte-Carlo simulation of Moran I
data: x
weights: listw
number of simulations + 1: 100
statistic = 0.095565, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
Show the code
gmp_WEE <- global_moran_perm(wm_ad$WEE,
wm_ad$nb,
wm_ad$wt,
nsim = 99)
gmp_WEE
Monte-Carlo simulation of Moran I
data: x
weights: listw
number of simulations + 1: 100
statistic = 0.083812, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
Across the 4 peak periods, the permutation test generated low p-values of <0.05. This indicates that we can reject the null hypothesis at the 95% level of confidence, and conclude that for each of the 4 peak periods, the overall bus ridership across Singapore is spatially clustered (since positive Moran’s I value is obtained).
5.2.2 Geary’s C
5.2.3 Spatial Correlogram
5.3 Local Spatial Autocorrelation Statistics
Global spatial autocorrelation provides a broad overview of spatial clustering within a dataset, offering a single value that indicates whether similar values are generally clustered or dispersed across the entire study area. In contrast, local spatial autocorrelation delves into specific locations, identifying where clusters of similar values (hot spots or cold spots) or spatial outliers exist. While global metrics give an overall trend, local metrics provide detailed, location-specific insights, highlighting exact areas of significant spatial clustering or anomaly.
Thus, after we have established through statistical testing that spatial clustering of bus ridership occurs in Singapore, we now seek to detect clusters or outliers and discover if there are any hot or cold spots of high ridership using Local Spatial Autocorrelation Statistics.
5.2.1 Local Moran’s I
In this section, we will perform Moran’s I statistics testing by using local_moran() of sfdep. The output of local_moran() is a sf data.frame, containing the columns below:
- ii: local moran statistic
- eii: expectation of local Moran’s I statistic
- var_ii: variance of local Moran’s I statistic
- z_ii: standard deviation of local Moran’s I statistic
- p_ii: p-value of local Moran’s I statistic using pnorm()
- p_ii_sim: For
localmoran_perm(),rank()andpunif()of observed statistic rank for [0, 1] p-values usingalternative= - p_folded_sim: the simulation folded [0, 0.5] range ranked p-value, based on crand.py of pysal
- skewness: For
localmoran_perm, the output of e1071::skewness() for the permutation samples underlying the standard deviates - kurtosis: For
localmoran_perm, the output of e1071::kurtosis() for the permutation samples underlying the standard deviates.
unnest() of tidyr package helps expand a list-column containing data frames into rows and columns.
Show the code
lisa_WDM <- wm_ad %>%
mutate(local_moran = local_moran(
origin_gridwgeom$WDM, nb, wt, nsim = 99),
.before = 1) %>%
unnest(local_moran)
head(lisa_WDM, n=5)Simple feature collection with 5 features and 28 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 3720.122 ymin: 27925.48 xmax: 4970.122 ymax: 31100.9
Projected CRS: SVY21 / Singapore TM
# A tibble: 5 × 29
ii eii var_ii z_ii p_ii p_ii_sim p_folded_sim skewness kurtosis
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.190 0.00735 0.0204 1.28 0.201 0.02 0.01 -2.55 8.37
2 0.190 -0.0100 0.0433 0.962 0.336 0.02 0.01 -5.72 42.9
3 0.186 0.00312 0.0459 0.853 0.394 0.02 0.01 -6.21 49.1
4 0.181 0.0166 0.0142 1.38 0.167 0.02 0.01 -2.06 5.79
5 0.167 0.00919 0.0118 1.45 0.147 0.02 0.01 -2.18 7.71
# ℹ 20 more variables: mean <fct>, median <fct>, pysal <fct>, nb <nb>,
# wt <list>, grid_id <int>, busstops.x <int>, ORIGIN_BS <chr>, WDA <dbl>,
# WDM <dbl>, WEE <dbl>, WEM <dbl>, logWDM <dbl>, logWDA <dbl>, logWEM <dbl>,
# logWEE <dbl>, BUS_ROOF_N <chr>, LOC_DESC <chr>, busstops.y <int>,
# area_hexagon_grid <POLYGON [m]>
lisa_WDA <- wm_ad %>%
mutate(local_moran = local_moran(
origin_gridwgeom$WDA, nb, wt, nsim = 99),
.before = 1) %>%
unnest(local_moran)
head(lisa_WDA, n=5)Simple feature collection with 5 features and 28 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 3720.122 ymin: 27925.48 xmax: 4970.122 ymax: 31100.9
Projected CRS: SVY21 / Singapore TM
# A tibble: 5 × 29
ii eii var_ii z_ii p_ii p_ii_sim p_folded_sim skewness kurtosis
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0752 0.00345 0.00381 1.16 0.245 0.02 0.01 -1.92 4.44
2 0.0795 0.00807 0.00673 0.870 0.384 0.04 0.02 -3.02 10.4
3 0.0734 0.00141 0.00811 0.799 0.424 0.08 0.04 -2.87 9.31
4 0.0567 0.0145 0.00114 1.25 0.212 0.02 0.01 -3.00 12.6
5 0.0415 0.00794 0.000767 1.21 0.225 0.02 0.01 -2.80 11.0
# ℹ 20 more variables: mean <fct>, median <fct>, pysal <fct>, nb <nb>,
# wt <list>, grid_id <int>, busstops.x <int>, ORIGIN_BS <chr>, WDA <dbl>,
# WDM <dbl>, WEE <dbl>, WEM <dbl>, logWDM <dbl>, logWDA <dbl>, logWEM <dbl>,
# logWEE <dbl>, BUS_ROOF_N <chr>, LOC_DESC <chr>, busstops.y <int>,
# area_hexagon_grid <POLYGON [m]>
lisa_WEM <- wm_ad %>%
mutate(local_moran = local_moran(
origin_gridwgeom$WEM, nb, wt, nsim = 99),
.before = 1) %>%
unnest(local_moran)
head(lisa_WEM, n=5)Simple feature collection with 5 features and 28 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 3720.122 ymin: 27925.48 xmax: 4970.122 ymax: 31100.9
Projected CRS: SVY21 / Singapore TM
# A tibble: 5 × 29
ii eii var_ii z_ii p_ii p_ii_sim p_folded_sim skewness kurtosis
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.143 0.000449 0.0154 1.15 0.252 0.04 0.02 -2.77 8.66
2 0.141 0.0126 0.00812 1.43 0.154 0.06 0.03 -0.884 0.201
3 0.130 -0.0145 0.0446 0.684 0.494 0.02 0.01 -3.95 18.0
4 0.124 0.00189 0.0111 1.16 0.247 0.04 0.02 -2.96 11.3
5 0.115 -0.00534 0.0161 0.948 0.343 0.02 0.01 -5.03 33.2
# ℹ 20 more variables: mean <fct>, median <fct>, pysal <fct>, nb <nb>,
# wt <list>, grid_id <int>, busstops.x <int>, ORIGIN_BS <chr>, WDA <dbl>,
# WDM <dbl>, WEE <dbl>, WEM <dbl>, logWDM <dbl>, logWDA <dbl>, logWEM <dbl>,
# logWEE <dbl>, BUS_ROOF_N <chr>, LOC_DESC <chr>, busstops.y <int>,
# area_hexagon_grid <POLYGON [m]>
lisa_WEE <- wm_ad %>%
mutate(local_moran = local_moran(
origin_gridwgeom$WEE, nb, wt, nsim = 99),
.before = 1) %>%
unnest(local_moran)
head(lisa_WEE, n=5)Simple feature collection with 5 features and 28 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 3720.122 ymin: 27925.48 xmax: 4970.122 ymax: 31100.9
Projected CRS: SVY21 / Singapore TM
# A tibble: 5 × 29
ii eii var_ii z_ii p_ii p_ii_sim p_folded_sim skewness kurtosis
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0904 -0.0160 0.0277 0.639 0.523 0.04 0.02 -4.96 27.3
2 0.0924 -0.00619 0.0143 0.824 0.410 0.02 0.01 -2.83 8.51
3 0.0809 0.00535 0.00947 0.777 0.437 0.14 0.07 -2.66 7.65
4 0.0756 0.000552 0.00974 0.760 0.447 0.02 0.01 -4.97 30.6
5 0.0637 0.000789 0.00272 1.20 0.228 0.04 0.02 -2.01 4.58
# ℹ 20 more variables: mean <fct>, median <fct>, pysal <fct>, nb <nb>,
# wt <list>, grid_id <int>, busstops.x <int>, ORIGIN_BS <chr>, WDA <dbl>,
# WDM <dbl>, WEE <dbl>, WEM <dbl>, logWDM <dbl>, logWDA <dbl>, logWEM <dbl>,
# logWEE <dbl>, BUS_ROOF_N <chr>, LOC_DESC <chr>, busstops.y <int>,
# area_hexagon_grid <POLYGON [m]>
6 Limitations
Should be analysed in conjunction with other modes of public transportation